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Abstract - In recent years, unspoken words recognition has 
received substantial attention from both the scientific research 
communities and the society of multimedia information access 
networks. Major advancements and wide range of applications 
in aids for the speech handicapped, speech pathology research, 
telecom privacy issues, cursor based text to speech, firefighters 
wearing pressurized suits with self contained breathing 
apparatus (SCBA), astronauts performing operations in 
pressurized gear, as a part of communication system operating 
in high background noise have propelled words recognition 
technology into the spotlight. Though early words recognition 
techniques used simple maximum likelihood algorithms only 
but the recognition process has now graduated into a science 
of mathematical representations and comparison processes. 
This survey paper provides an up-to-date review of the existing 
approaches and offers some insights into the study of unspoken 
words recognition. A number of typical techniques and EMG 
based approaches are discussed in this paper. Furthermore, a 
discussion outlining the incentives for using recognition 
techniques, the applications of this technology, and some of 
the difficulties plaguing the current systems with regard to 
this topic have also been provided. 

Keywords- Speech Pathologies, Electromyography, Hidden 
Markov Models, Myoelectric Signals (MES). 

I. Introduction 

Unspoken words recognition relates to the task of 
enabling speech communication in the absence of acoustic 
signal. In this technique, data is acquired from elements of 
the human speech production process such as articulators, 
their neural pathways, or the brain itself. It produces a digital 
representation of speech which can be synthesized directly, 
may be interpreted as data, or fed into a communication 
network. Persons who have undergone a laryngectomy, or 
older citizens for whom speaking requires a substantial effort, 
would be able to mouth words rather than actually 
pronouncing them. For this, the unspoken words recognition 
has proved as an aid [2]. Alternatively, those unable to move 
their articulators due to paralysis could produce speech or 
issue commands simply by cognitively concentrating on the 
words to be spoken. Further, as SSI (Silent Speech Interface) 
is build upon the existing human speech production process, 
augmented with digital sensors and processing, they have 
the potential to be more natural sounding, spontaneous, and 
intuitive to use than such currently available speech 
pathology solutions as the electrolarynx, trachea- 
oesophageal speech (TOS), and cursor-based text-to-speech 
systems [1]. 

With reference to the block diagram of Fig 1 , the facial 
EMG signal is taken by different speech articulators, which 
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are responsible for the production of speech sounds by dif- 
ferent data acquisition techniques as discussed in Table-I. 
Signal acquired is then normalized and activity detection of 
each signal is done [4]. From the normalized signal, features 
are extracted [6], [7]. After feature extraction, data analysis 
and feature selection is done [9], [10]. These selected fea- 
tures are fed into different type of classifiers [12], [14] and 
comparison between spoken and unspoken is done [23], [24]. 

The content of this article includes Introduction, gener- 
alized block diagram, comprehensive analysis of the refer- 
ences which provides an easy distinction between different 
techniques of SEMG (surface electromyography), conclu- 
sions and future perspectives. 

II. Related Work 

Unspoken words recognition is a medium of speech com- 
munication without using the sound when people tend to 
vocalize their speech sound. It is a type of electronic reading 
of facial muscles by the computer, identifying the phoneme 
and words that an individual attempts to pronounce from 
non-auditory sources of information about their speech move- 
ments. If the input to such a computer based system is plain 
text that is it does not contain additional phonems, the infor- 
mation system is called text-to- speech system. By creating 
synthetic model of human physiology, articulatory speech 
synthesis is accurately possible. Experimental system has 
evolved seven different types of technology which are used 
for acquiring the speech signal [2]. Substantial improvements 
in the word recognition have led to the development of addi- 
tional sensors like throat microphones which are used as 
part of multimodal speech recognition. Experimental studies 
results in adequate growth of computing power and minimiz- 
ing the electronic size which reduces the impact of noise [3]. 
Application field of SSI comprises assistance to a person 
who has undergone laryngectomy, an alternative to the 
electrolarynx, to oesophagal speech and tracheo-oesophagal 
speech. Due to its non-invasive property, SEMG provides 
good time resolution in clinical applications as it is well 
adopted in imaging and analysis [4]. 

EMG based technology has an interesting property that 
the little or acoustic energy produced during speech can 
also be detected. EMG activity can be detected even when 
the subject whispers and moves the mouth without producing 
the sound [7], [13]. EMG to speech approach is preferable in 
human to human communication particularly when there is 
no restriction of vocabulary and direct mapping of EMG 
signal is allowed to collect required speech content [15], 
[16]. Automatic speech recognition system is inherently ro- 
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Figure 1: Generalized block diagram of unspoken word recognition 

Table I. Comprehensive analysis of references 



bust to high background noise, as the EMG electrodes mea- 
sure the muscle activity of the skin tissue and is not based 
on transmitted signal in the air, so it allows confidential input 
in public places and provides robustness to ambient noise 
[ 17] . EMG measures the bio potential in terms of electric cur- 
rent that is generated by the muscle during its contraction 
which describes its neuro-muscular activities and these move- 
ments of the muscles are translated into speech signal [18]. 

In past two decades, SEMG has attracted more and more 
applications in rehabilitation and human computer interface 
[18]. The potential applications of SEMG are for facial motion 
disorders such as stroke patients may have swallowing dis- 
orders, for paralytic patients who suffer from illness or acci- 
dent and many prosthetic device are also controlled by SEMG 
[20], [21]. Therefore, SEMG provides a valuable reference in 
clinical diagnosis and biomedical application. A special neu- 
ral feature of electromyography is that, movements are not 
only 



Authors 



Data collection specification methods and 
sampling frequency 



Methods of feature 
extraction\selection 



Classification 
technique 



Efficiency/Efficacy 



Robin Hofe et al. [5] 



Magnetic sensing by employing electro- 
magnetic articulography 
Sampling Frequency: 50Hz to 500Hz 



Acoustic signal is 
reconstructed by 
using Mai 
Frequency Cepstral 
Cofficient (MFCC). 
HMM 



Viterbi 

alignment 

algorithm 



An approach to mitigate 
the influence of additional 
noise can be overcome by 
employing a set of 
additional pellets. 



J.M.Gilbert et al. [6] 



Magnetic sensors are placed on the tongue 
and lips of subject and record the 
movement of the two. 

Sampling Frequency: 500-1000 
samples/sec 



Wavelet 
Transform(WT) 



Statistical 
modeling 
technique 



Very high recognition rate 
can be achieved with 
reduced sets of implants to 
achieve better performance 
and preserve the acoustic 
signal from distortion. 



Bradley J.Betts et al. [7] 



Facial EMG is captured by using Ag/Agcl 
electrodes as a sensing device 
Sampling frequency : 1 OOOOsamples/sec 



Mximum likelihood 
algorithm, WT 



Neural network 
classifier 



Consistency of compu- 
tational requirement neces- 
sitates to typically work in 
wearable environment. Re- 
al time performance of the 
system to be quantified. 



K. Yoshida 
et al. [8] 



For high recognition performance HMM 
recognition method is used to efficiently 
treat contextual and allophonic variation 
which utilizes acoustic knowledge. 



Hidden markov 
model(HMM), WT 



single Gaussian 
probability 
density function 



For high recognition rate 
a neural network classifier 
can be employed and 
additional noise can be 
minimized. 



H. Manabe et al. 
[9] 



Three channels surface electromyography 
is used for acquiring the training data. 
Sampling Frequency: lOOOsamples/sec. 



Multi-stream HMM 



Mai frequency 
cepstral 
coefficient 
(MFFC) 



By employing five 
channels EMG recognition 
accuracy may be improved. 



Erik. J. Scheme et al. 
[10] 



Five channel MES data is collected by 
using Ag/Agcl duotrode bipolar electrode 
pairs placed over the five articulatory 
muscles of the face. 
Sampling frequency :5 KHz 



HMM 



maximum 
likelihood 
output of viterbi 
algorithm 



Feasibility and accuracy of 
the real time acoustic data 
using its automatic 
segmentation & expansion 
of phoneme. 



Noboru Sugie et al. [11] 



Using three channels of EMG and Ag/Agcl 
electrodes raw EMG signals are collected. 
Sampling frequency: 1250samples/channel 



Linear Discriminant 
Analysis (LDA) for 
discrimination of 
vowels 



Co articulation may cause 
difficulty so continuous 
voice production should be 
there. 



Ki-Seung Lee et al. [12] 



A 3 -channel EMG with Ag/Agcl surface 
electrodes are used for data acquisition 
Sampling frequency: 100msec length of 
hamming window is used to extract 
features at each 20msec/ interval 



Continuous hidden 
markov models, 
MultivariateGaussia 
n distribution. 



Artificial 
Neural network 



For optimal performance, 
the data collection must be 
large and recognition rate 
can be improved. 



Quan Zhau et al. [14] 



Five channel MES data is collected by 
using Ag/Agcl duotrode bipolar electrode 
pairs placed over the five muscles of face. 
Sampling frequency :10kHz 



Principal 

component analysis, 
MFFC 



Gaussian 
mixture model 



Automatic myoele-ctric 
can be developed to make 
the existing system more 
robu st. 



S.P.Arjunan et.al [19] 



SEMG is used to measure the relative 
activities of four facial muscles. 
Window size of 20 samples corresponds to 
10msec 



Root mean square 



Error Back 
Propagation 
Neural Network 



Future possibility may 
include speech based 
computer control in high 
background noise. 
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voluntary but also have emotional control which leads to the 
hypothesis that these voluntary controls are mediated by 
integrated network [22]. 

Focusing on natural and existing EMG based speech rec- 
ognition systems, the most practiced approaches are pho- 
neme recognition, vowel recognition and complete words 
recognition [25], [26]. The main disadvantage of existing 
speech interfaces is their limited robustness in the presence 
of high background noise, so several electro-myographic ap- 
proaches have been developed in which acoustic speech 
recognition is replaced by automatic speech recognition [2]. 
These approaches overcome the ambient noise and also pro- 
duce an alternative human computer interaction for the per- 
sons having speech disability [10]. SEMG generates the rel- 
evance of multi modal interfaces which are used as a way of 
communication and reducing the information load in human- 
human and human-agent systems. In space, both input and 
output are limited due to severe conditions of atmosphere 
and acoustic signal is the most convenient way of communi- 
cation [27], [28]. The purpose of this article is to present a 
light expression facial EMG in which electrodes work as a 
sensor, placed on the specific muscles needed to be exam- 
ined and develop a recognition system. 

III. Conclusions 

Unspoken words recognition needs to be evaluated on 
purely "silent" databases in which sound is neither vocal- 
ized nor whispered. Specification of articulators will clearly 
reveal silent speech. Another improvement that could be ex- 
pected in the field of words recognition is that there may be 
increase in training data and an adaptive learning scheme 
could be employed for improving the system performance. 
By enabling the addition of new words to the word based 
system and investigating the sources of noise such as elec- 
tromagnetic waves and effects of gravity on speech signal, 
the system performance can be improved. Optimal location 
of electrodes should be known so there is a requirement of 
detailed analysis between the function of each facial muscle 
and the words spoken. 
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